Skip to content

Conversation

@fvaleri
Copy link
Contributor

@fvaleri fvaleri commented Nov 21, 2025

This patch fixes a tight broker to controller reconnection loop that may happen during shutdown.

  1. Node 1 and 2 (brokers) request controlled shutdown
  2. Controller grants the shutdown
  3. Controller itself shuts down (RaftManager shutdown)
  4. Node 1 and 2 continue trying to heartbeat to the now-dead controller
  5. They get stuck in this reconnection loop because the NodeToControllerRequestThread is still running and hasn't been shut down properly

The reconnection loop goes on for exactly 5 minutes, which is the shutdown timeout hard coded in KafkaBroker trait.

This is what I have from another test logs for one of the brokers:

  • SIGTERM received: 14:39:46,282
  • Actual shutdown completed: 14:44:46,385
  • Time elapsed: 5 minutes and 0.103 seconds (approximately 5 minutes)

I acknowledge that this is unlikely to happen with brokers running on different machines, but not so unlikely when running tests locally on a single physical machine.

@github-actions github-actions bot added triage PRs from the community core Kafka Broker small Small PRs labels Nov 21, 2025
@fvaleri fvaleri marked this pull request as draft November 21, 2025 15:36
This patch fixes a tight broker to controller reconnection loop that may happen during shutdown.

1. Node 1 and 2 (brokers) request controlled shutdown
2. Controller grants the shutdown
3. Controller itself shuts down (RaftManager shutdown)
4. Node 1 and 2 continue trying to heartbeat to the now-dead controller
5. They get stuck in this reconnection loop because the NodeToControllerRequestThread is still running and hasn't been shut down properly

The reconnection loop goes on for exactly 5 minutes, which is the shutdown timeout hard coded in KafkaBroker trait.

This is what I have from another test logs for one of the brokers:

    SIGTERM received: 14:39:46,282
    Actual shutdown completed: 14:44:46,385
    Time elapsed: 5 minutes and 0.103 seconds (approximately 5 minutes)

I acknowledge that this is unlikely to happen with brokers running on different machine, but not so unlikely when running tests locally on a single physical machine.

Signed-off-by: Federico Valeri <[email protected]>
@fvaleri fvaleri closed this Nov 25, 2025
@fvaleri fvaleri deleted the fix-shutdown-loop branch November 25, 2025 17:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Kafka Broker small Small PRs triage PRs from the community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant